Spaced Seed Data Structures for De Novo Assembly

نویسندگان

Inanç Birol

Justin Chu

Hamid Mohamadi

Shaun D. Jackman

Karthika Raghavan

Benjamin P. Vandervalk

Anthony Raymond

René L. Warren

چکیده

De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering of Short Read Sequences for de novo Transcriptome Assembly

Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...

متن کامل

Succinct data structures for assembling large genomes

MOTIVATION Second-generation sequencing technology makes it feasible for many researches to obtain enough sequence reads to attempt the de novo assembly of higher eukaryotes (including mammals). De novo assembly not only provides a tool for understanding wide scale biological variation, but within human biomedicine, it offers a direct way of observing both large-scale structural variation and f...

متن کامل

Evaluating Biological Sequences

Title of dissertation: SEARCHING, CLUSTERING AND EVALUATING BIOLOGICAL SEQUENCES Mohammadreza Ghodsi, Doctor of Philosophy, 2012 Dissertation directed by: Professor Mihai Pop Department of Computer Science The latest generation of biological sequencing technologies have made it possible to generate sequence data faster and cheaper than ever before. The growth of sequence data has been exponenti...

متن کامل

On solving possibilistic multi- objective De Novo linear programming

Multi-objective De Novo linear programming (MODNLP) is problem for designing optimal system by reshaping the feasible set (Fiala [3] ). This paper deals with MODNLP having possibilistic objective functions coefficients. The problem is considered by inserting possibilistic data in the objective functions coefficients. The solution of the problem is defined and established under the using of effi...

متن کامل

De novo Genome Assembly and Single Nucleotide Variations for Soybean Mosaic Virus Using Soybean Seed Transcriptome Data

Soybean is the most important legume crop in the world. Several diseases in soybean lead to serious yield losses in major soybean-producing countries. Moreover, soybean can be infected by diverse viruses. Recently, we carried out a large-scale screening to identify viruses infecting soybean using available soybean transcriptome data. Of the screened transcriptomes, a soybean transcriptome for s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 2015 شماره

صفحات -

تاریخ انتشار 2015

Spaced Seed Data Structures for De Novo Assembly

نویسندگان

چکیده

منابع مشابه

Clustering of Short Read Sequences for de novo Transcriptome Assembly

Succinct data structures for assembling large genomes

Evaluating Biological Sequences

On solving possibilistic multi- objective De Novo linear programming

De novo Genome Assembly and Single Nucleotide Variations for Soybean Mosaic Virus Using Soybean Seed Transcriptome Data

عنوان ژورنال:

اشتراک گذاری